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We give a general result concerning the rates of convergence of 
penalized empirical risk minimizers (PERM) in the regression model. 
Then, we consider the problem of agnostic learning of the regression, 
and give in this context an oracle inequality and a lower bound for 
PERM over a finite class. These results hold for a general multivari- 
ate random design, the only assumption being the compactness of 
the support of its law (allowing discrete distributions for instance). 
Then, using these results, we construct adaptive estimators. We con- 
sider as examples adaptive estimation over anisotropic Besov spaces 
or reproductive kernel Hilbert spaces. Finally, we provide an empir- 
ical evidence that aggregation leads to more stable estimators than 
more standard cross-validation or generalized cross-validation meth- 
ods for the selection of the smoothing parameter, when the number 
of observation is small. 



1. Introduction. 

1.1. Motivations. In this paper, we explore some statistical properties of 
penalized empirical risk minimization (PERM) and aggregation procedures 
in the regression model. From these properties, we will be able to obtain 
results concerning adaptive estimation for several problems. Given a data set 
we consider two problems. Let us define the norm \\g\\^ := / g{x)'^Px{dx) 
where Px is the law of the covariates and let E[-] be the expectation w.r.t. 
the joint law of Dn. The first problem is the problem of estimation of the 
regression function /q. Namely, we aim at constructing some procedure /„ 
satisfying 

E\\fn-fof <i'{n) (1.1) 

where tp{n), called the rate of convergence, is a quantity we wish very small 
as n increases. To get this kind of inequality, it is well-known that one has 
to assume that /o belongs to a set with a small complexity (cf., for instance, 
the "No free Lunch theorem" in [11]). This is what we do in Section 2 below, 
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where an assumption on the complexity is considered, see Assumption {Cjs) 
on the metric entropy. 

However, this kind of "a priori" may not be fulfilled. That is why the 
second problem, called agnostic learning has been introduced (cf. [17, 23] 
and references therein). For this problem, one is given a set F of functions. 
Without any assumption on /o, we want to construct (from the data) a 
procedure / which has a risk as close as possible to the smallest risk over 
F. Namely, we want to obtain oracle inequalities, that is inequalities of the 
form 

E\\f - /of < Cmin ||/ - /of + <A(n,F), 

where C > 1 and (j){n, F) is called the residue, which is the quantity that we 
want to be small as n increases. When F is of finite cardinality M, the agnos- 
tic problem is called aggregation problem and the residue (p^n, F) = (p{n, M) 
is called rate of aggregation. The main difference between the problems of 
estimation and aggregation is that we don't need any assumption on /o for 
the second problem. Nevertheless, aggregation methods have been widely 
used to construct adaptive procedures for the estimation problem. That is 
the reason why we study aggregation procedures in Section 3 below. We will 
use these procedures in Section 4 to construct adaptive estimators in several 
particular cases, such as adaptive estimation in reproductive kernel Hilbert 
spaces (RKHS) or adaptive estimation over anisotropic Besov spaces. 

In Section 3, we also prove that the "natural" aggregation procedure, 
namely empirical risk minimization (ERM) (or its penalized version), fails 
to achieve the optimal rate of aggregation in this setup. This result moti- 
vates the use of an aggregation procedure instead of the most common ERM. 
Moreover, we provide an empirical evidence in Section 5 that aggregation 
(with jackknife) is more stable than the classical cross-validation or gener- 
alized cross-validation procedures when the number of observations and the 
signal-to-noise ratio are small. 

The approach proposed in this paper allows to give rates of conver- 
gence for adaptive estimators over very general function sets, such as the 
anisotropic besov space, with very mild assumption on the law of the co- 
variates: all the results are stated with the sole assumption that the law of 
the covariates is compact. 

1.2. The model. Let {X, Y), {Xi, Yi), . . . , (X„, y„), be independent and 
identically distributed variables in R'^ X M. We consider the regression model 



Y = fo{X) + ae. 



(1.2) 
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where /o : M and e is called noise. To simplify, we assume that the 

noise level a is known. We denote by P the probability distribution of (X, Y) 
and by Px the margin distribution in X or design^ or covariates distribution. 
We denote by P^ the joint distribution of the sample 

Dn:= [{Xi,Y.{)- l<i<n], 

and by P„ = where X" := (Xi, . . . ,X„), the joint distribution of 

the sample D„ conditional on the design X"^ := {Xi, . . . , Xn). The expecta- 
tion w.r.t. Pn is denoted by En- The noise e is symmetrical and subgaussian 
conditionally on X. Indeed, we assume that there is 6^ > such that 

: E[exp{te)\X] < exp{blt^ /2) Vt > (1.3) 

which is equivalent (up to an appropriate choice for the constant 6^) to 

{G2){be) : P[e > t\X] < exp(-tV(26^)) Vt > 0. 

Assumption (1.3) is standard in nonparametric regression, it includes the 
models of bounded and Gaussian regression. An important fact, that will 
be used in the proofs, is that for ei, . . . ,e„ independent and such that £i 
satisfies (G'l)(6j) for any z = 1, . . . , n, the random variable ^27=1 (^i^i satisfies 
(Gl)(X]af&i) for any - • • ,a„ G M and thus the concentration property 
(G2)(\/2 ^ a?6?). Other equivalent definitions of subgaussianity are, when 
e is symmetrical, to assume that i?[exp(e^/6g|X)] < 2 for some > 0, or 
{E[\e\P\X]y/P < be^ for any p>l. 

Concerning the design, we only assume that X has a compact support, 
and without loss of generality we can take its support equal to [0, 1]"^. In 
particular we do not need Px to be continuous with respect to the the 
Lebesgue measure. Note that the problem of adaptive estimation with such 
a general multivariate design is not common in literature. In the so-called 
"distribution free nonparametric estimation" framework, when we want to 
obtain convergence rates and not only the consistency of the estimators, it 
is, as far as we know, always assumed that \Y\ < L a.s. for some constant 
L > 0, see for instance [15], [30], [31], [29] and [27], which is a setting less 
general than the one considered here. 

Remark. The results presented here can be extended to subexponential 
noise, that is when £'[exp(]e]/6e)]X] < 2 for some b^ > 0, but it involves 
complications (chaining with an adaptative truncation argument in the proof 
of Theorem 1 below, see for instance [7] or [44] , among others) that we prefer 
to skip here. 
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2. PERM over a large function set. We consider the following prob- 
lem of estimation: we fix a function space T and we want to recover /o 
based on the sample Dn using the knowledge that /o G T . The set T is 
endowed with a seminorm | • To fix the ideas, when d = 1, one can 
think for instance of the Sobolev space T = of functions such that 
|/|2, = J f(^){tfdt < +00, where s is a natural integer and /^^^ is the s-th 
derivative of /. In this case, the estimator described below is the so-called 
smoothing spline estimator, see for instance [46]. Several other examples are 
given in Section 4 below. 

2.1. Definition of the PERM. The idea of penalized empirical risk mini- 
mization is to make the balance between the goodness-of-fit of the estimator 
to the data with its smoothness. The quantity measures the smoothness 
(or "roughness") of / G and the balance is quantifyied by a parameter 
h>0. 

Definition 1 (PERM). Let A = {h,J^) be fixed. We say that fx is a 
penalized empirical risk minimizer if it minimizes 

Rn{f)+venx{f) (2.1) 

over J^, where pen;^(/) := for some a > and where 

1 

Rnif) := \\Y-f\\l = -J2{Y,-f{X,)f 

1=1 

is the empirical risk of / over the sample Dn- 

The parameter a is a tuning parameter, which can be chosen depending 
on the seminorm | • see the examples in Section 4. For simplicity, we shall 
always assume that a PERM f\ exists, since we can always find f\ such that 
Rn{f\)+P^^x{f\) < infjgjp{i?„(/)-|-pen_)^(/)}-|-l/n which satisfies the same 
upper bound from Theorem 2 (see below) as an hypothetic fx. However, a 
minimizer may not be necessarily unique, but this is not a problem for the 
theoretical results proposed below. PERM has been studied in a tremendous 
number of papers, we only refer to [43, 44], [36] and [15], which are the closest 
to the material proposed in this Section. 

In Theorem 2 below we propose a general upper bound for PERM over 
a space J- that satisfies the complexity Assumption (C/j) below. The proof 
of this upper bound involves a result concerning the supremum of the em- 
pirical process Z{f) := (yn~^l'^Y^=\ f{^i)^i over f ^ T which is given in 
Theorem 1 below. 
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2.2. Some definitions and useful tools. Let (i?, || • ||) be a normed space. 
For z £ E, we denote by B{z, 6) the ball centered at z with radius 6. We 
say that {zi, . . . , Zp} is a 5-cover of some set A C E if 

Ac \J B{zi,6). 

l<i<p 

The 5-covering number N{6,A, \\ ■ ||) is the minimal size of a 5-cover of A 
and 

H{5,A,\\.\\):=logN{6,A,\\.\\) 

is the 6-entropy of A. The main assumption in this section concerns the 
complexity of the space J^, which is quantified by a bound on the entropy 
of its unit ball Bjr ■= {f e F : |/|^ < 1}. We denote for short H^{5,A) = 
H{6,A, II • lloo) where ||/||oo := sup^e[o,i]<^ \f{x)\- We denote by C{[0,lf) the 
set of continuous functions on [0, l]'^. 

Assumption (C^j). We assume that C C([0, 1]"^) and that there is a 
number (3 £ (0, 2) such that for any 6 > 0, we have 

Hooi5,B:p) <D6~^ (2.2) 
where -D > is independent of 6. 

This assumption entails that, for any radius i? > 0, we have 
H^{S,BAR)) KD^jY 

where Bjrl^R) := {f £ : |/|jf < R}. Assumption (Cp) is satisfied by barely 
all smoothness spaces considered in nonparametric literature (at least when 
the smoothness of the space is large enough compared to the dimension, see 
below). The most general space that we consider in this paper and which 
satisfies (Cp) is the anisotropic Besov space -Bp^g, where s = (si, . . . , s^) is a 
vector of positive numbers. This space is precisely defined in Appendix A. 
Each Si corresponds to the smoothness in the direction e^, where {ei, . . . , e^} 
is the canonical basis of R'^. The computation of the entropy of Bp^ can be 
found in [39], we give more details in Appendix A. If s is the harmonic mean 
of s, namely 

1 1 1 . . 

1=1 ' 

then i?p q satisfies (C/3) with P = d/s, given that s > d/s, which is the usual 
condition to have the embedding Bp ^ C C([0, l]'^). 
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Remark. Under the restriction /? G (0, 2), the Dudley's entropy integral 
satisfies 

/•diam(Bjr,||-||oo) i 

where diam(i?jF, || • ||oo) is the Loo-diameter of Bjr. This is a standard as- 
sumption coming from empirical process theory. It is related to the so-called 
chaining argument, that we use in the proof of Theorem 1. However, in or- 
der to consider a larger space of functions J-', we could think of function 
spaces with a complexity /? > 2. In this case, using a slightly different chain- 
ing argument (cf . [45] ) , the quantity appearing in the upper bound of some 
subgaussian process is of the type X;/™''^'^'" \/-^oo{S, Bjr)d5 which con- 
verges whatever (5 is. However, such considerations are beyond the scope of 
the paper and are to be considered in a future work. 

2.3. About the supremum of the process Z{-). The beginning of the proof 
of Theorem 2 is, as usual with the proof of upper bounds for M-estimators, 
based on an inequality that links the empirical norm of estimation and the 
empirical process of the model. This idea goes back to key papers [42] and 
[4], see also [43, 44] and [36] for a detailed presentation. In regression, it 
writes, if / is a PERM and if /o € T: 

II/-/0II' +pen(/) < Az„(/_/o)+pen(/o), 

where 

n 

Z„(/):=^E/(X,)e,. (2.4) 

^ 1=1 

This inequality explains why the next Theorem 1 is the main ingredient 
of the proof of Theorem 2 below. Then, an important remark is that (1.3) 
entails 

Z2 



Pn[^n(/)>^]<exp(^^^) (2.5) 



for any fixed /, z > and n > 1, where ||/||^ := Y17=i fi^i)'^ ^^'^ where 
we take for short b := ab^. This deviation inequality is at the core of the 
proof of Theorem 1 below. Let us introduce the empirical ball Bn{fo,S) := 
{f '■ 11/ — fo\\n < ^} and let us recall that Pn ■= i-'"[-|X"] is the joint law of 
the sample Dn conditionally to the design X" = {Xi, . . . , Xn)- 

Theorem 1. Let Zn{-) be the empirical process (2.4) and assume that 
{J-, I • |jr) satisfies (C^). Then, if /o G J^, we can find constants zi > and 



AGGREGATION OF PENALIZED EMPIRICAL RISK MINIMIZERS 



7 



Di > such that: 

' < e^v{-Diz'^5-^) (2.6) 



p 



ZnU - fo) ^ ^ 



for any 6 > and z > zi (we recall that (3 £ (0,2)). 

The proof of this Theorem is given is Section 6, it uses techniques from 
empirical process theory such as peeUng and chaining. It is a uniform version 
of (2.5), locahzed around /o (for the empirical norm). In this theorem, we 
use the "weighting trick" that was introduced in [42, 44]: we divide Zn{-) by 
II/ — /oil n and |/|jF in order to counterpart, respectively, the variance of Z„(-) 
and the massiveness of the class J-. This renormalization of the empirical 
process is also at the core of the proof of Theorem 2. 

2.4. Upper bound for the PERM. Theorem 2 below provides an upper 
bound for the mean integrated squared error (MISE) of the PERM, both 
for integration w.r.t. the empirical norm ||/||^ = J27=i fi-^i)'^ the 
norm \\ff := J fix^Pxidx). 

Theorem 2. Let T he a space of functions satisfying {Cp). Let A = 
{h,J-) and f\ he a PERM given hy (2.1), where h satisfies 

h = an-'/^^+f"^ (2.7) 

for some constant a > and where q > 2/?/(/3 + 2). If fo £ J^, we have: 

i?n||/A-/0||^<Cl(l + |/0|^)n-^/('+^) 

for n large enough, where Ci is a fixed constant depending on a, (3, a and 
b. If we assume further that \\fx — foWoo ^ Q cl-S- for some constant Q > 0, 
we have 

i5^"||/A - /of < C2(l + |/o|^)n-2/(2+/3) 

for n large enough, where C2 is a fixed constant depending on Ci and Q. 

Remark. Theorem 2 holds if we truncate fx by some constant Q such 
that 1 1 /oil 00 < Q- Such a truncation cannot be avoided in such a general 
regression setting. Indeed, the PERM is, without truncation, in general non 
consistent, see the example from Problem 20.4, p. 430 in [15]. 



Remark. Theorem 2 holds for any design law Px, even for the degen- 
erate case where Px = for some fixed point x G [0, 1]*^, where 6 is the 
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Dirac probability measure. Of course, in this case, the rate n"^/'^^"*'^^ be- 
comes suboptimal, since the estimation problem with such a Px is no more 
"truly nonparametric" . Indeed, for a discrete Px with finite support, it is 
proved in [16] that the optimal rate is the parametric rate 1/n using a local 
averaging estimator. 

2.5. About the smoothing parameter h. It is well-known that in practice, 
the choice of the parameter h is of first importance. From the theoretical 
point of view, in order to make f\ rate-optimal, h must equal in order to a 
quantity involving the complexity of .7-": see condition (2.7) on the bandwidth 
and the Assumption (C^). This problem is commonplace in nonparametric 
statistics. Indeed, the role of the penalty in (2.1) is to make the balance 
with the massiveness of the space J-. Without this penalty, or if h is too 
small, fx roughly interpolates the data, which is not suitable when the aim 
is denoising (this phenomenon is called overfitting). 

Of course, the complexity parameter /5 is unknown to the statistician, 
and even worse, it does not necessarily make sense in practice. So, several 
procedures are proposed to select h based on the data. The most popular 
are the leave-one-out cross validation (CV) and the simpler generalized cross 
validation (GOV), which is often used with smoothing spline estimators 
because of its computational simplicity, see [46] among others. Such methods 
are known to provide good results in most cases. However, there is, as far 
as we know, no convergence rates results for estimators based on CV or 
GCV selection of smoothing parameters. In Section 4 below, we propose an 
alternative approach. Indeed, instead of selecting one particular h, we mix 
several estimators computed for different h in some grid using an aggregation 
algorithm. This aggregation algorithm is described in Section 3. We show 
that this approach allows to construct adaptive estimators with optimal 
rates of convergence in several particular cases, see Section 4. Moreover, we 
prove empirically in Section 5 that the aggregation approach is more stable 
than CV or GCV when the number of observations is small. 

3. PERM and aggregation over a finite set of functions. Let us 

fix a set F(A) := {fx : A G A} of arbitrary functions, and denote by M = | A| 
its cardinality. 

3.1. Suboptimality of PERM over a finite set. In this section, we prove 
that minimizing the empirical risk Rn{-) (or a penalized version) on -F(A) 
is a suboptimal aggregation procedure in the sense of [41]. According to 
[41], the optimal rate of aggregation in the gaussian regression model is 
(log M) /n. This means that it is the minimum price one has to pay in order 
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to mimic the best function among a class of M functions with n observa- 
tions. This rate is achieved by the aggregate with cumulative exponential 
weights, see [9] and [22]. In Theorem 3 below, we prove that the usual PERM 
procedure cannot achieve this rate and thus, that it is suboptimal compared 
to the aggregation methods with exponential weights. The lower bounds for 
aggregation methods appearing in the literature (see [22, 32, 41]) are usu- 
ally based on minimax theory arguments. The one considered here is based 
on geometric considerations, and involves an explicit example that makes 
the PERM fail. For that, we consider the Gaussian regression model with 
uniform design. 

Assumption (G). Assume that e is standard Gaussian and that X is 
univariate and uniformly distributed on [0, 1]. 

Theorem 3. Let M > 2 be an integer and assume that (G) holds. We 
can find a regression function fo and a family F{A) of cardinality M such 
that, if one considers a penalization satisfying \ pen(/)| < C \/ (log M) /n,'i f G 
F{K) with < C < a{24:y/2c*)~^ (c* is an absolute constant from the Su- 
dakov minorization, see Theorem 7 in Appendix B), the PERM procedure 
defined by 

fn G argmin(i?n(/) + pen(/)) 

satisfies 

for any integer n > 1 and M > Mo((t) such that n^^ log[(M — l)(Af — 2)] < 
1/4 where C3 is an absolute constant. 

This result tells that, in some particular cases, the PERM cannot mimic 
the best element in a class of cardinality M faster than ((log Af )/n)^/^. This 
rate is very far from the optimal one (log M)/n. 

Let F{A) be the set that we consider in the proof of Theorem 3 (see 
Section 6 below), and take pen(/) = 0. Using Monte-Carlo (we do 5000 
loops), we compute the excess risk E\\fn — /o|P — ^^^feF(A) 11/ ~ /o|P of 
the ERM. In Figure 1 below, we compare the excess risk and the bound 
{{log M)/n)^^'^ for several values of M and n. It turns out that, for this set 
F{A), the lower bound ((log M)/n)^/^ is indeed accurate for the excess risk. 
Actually, by using the classical symmetrization argument and the Dudley's 
entropy integral, it is easy to obtain an upper bound for the excess risk of 
the ERM of the order of ((log M)/n)^/^ for any class -F(A) of cardinality M. 
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Fig 1. The excess risk of the ERM compared to ((logM)/n)^'^^ for several values of M 
and n (x-axis) 



3.2. Aggregation. For each fx € -^(A), we compute a weight 9{fx) G [0, 1] 
such that J2\£A ^(/a) = 1- These weights give a level of significance to each 
fx S -^(A). The aggregated estimator is then the convex combination 

f:=E^(/A)/A, (3-1) 
AeA 

where the weight of / € F{A) is given by 

EagA exp ( - nRn{fx)/T) 

where T > is the so-called temperature parameter and where Rn{f) is the 
empirical risk of /. This aggregation algorithm (with "Gibbs" or "exponen- 
tial" weights) can also be found for instance in [9, 20, 21, 33, 35, 47, 48]. See 
also [13] for adaptation by aggregation in a semiparametric model. 

The next theorem is an oracle inequality for the aggregation method (3.2). 
It will be useful to derive the adaptive upper bounds stated in Section 4 
below. 

Theorem 4. Assume that for any f G -^(A), we have ||/ — /o||oo < Q 
for some Q > 0. For any a > 0, the aggregation method (3.2) satisfies 

T?n\(i ^ ||2 ^ /I , ^ ■ 11/ / ii2 , (n , (log n) log M 

^ f-/o <(l + a mm /-/o +[C + T) , 

/eF(A) n 

where C is a constant depending on a, Q and a. 



When T is too large, the weights (3.2) are close to the uniform law over the 
set of weak estimators, and of course, the resulting aggregate is inaccurate. 
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When T is too small, one weight is close to 1, and the others close to 0: in this 
situation, the aggregate does barely the same job as the ERM procedure. 
This is not suitable since Theorem 3 told us that ERM is suboptimal. Hence, 
T realize a tradeoff between the ERM and the uniform weights procedure. 
It can be simply chosen by minimization of the empirical risk. We know 
empirically that it provides good results, see [13]. Namely, we select the 
temperature 

n 

f :=argminV(yi-f(^)(Xi))', (3.3) 

i=i 

where f^"^-* is the aggregated estimator (3.1) with temperature T and where 
T is some set of temperatures. This is what we do in the empirical study 
conducted in Section 5. 

4. Examples of adaptive results. In this section, we construct adap- 
tive estimators for several regression problems using the tools from Section 2 
and 3. This involves, as usual with algorithms coming from statistical learn- 
ing theory, a split of the sample into two parts (an exception can be found 
in [35] ) . The main steps of the construction of adaptive estimators given in 
this section are: 

1. split, at random, the whole sample Dn into a training sample 

Dm := [{Xi,Yi):l<i<m], 
where m < n, and a learning sample 

D^m) ■■= [iX^,Yi) ■.m + l<i<n]; 

2. choose a set A of parameters and compute, using the training sample 
Dm, the corresponding class F{A) = {/a : A G A} of PERM (see 
Definition 1 in Section 2). Each A depends on the considered problem 
of adaptive estimation, see below; 

3. using the learning sample D(^m)i compute the aggregation weights and 
the aggregated estimator f„, respectively given by Equations (3.2) 
and (3.1). 

Then, using Theorem 2 (see Section 2) and Theorem 3 (see Section 3), 
we will derive adaptive upper bounds for estimators constructed in this 
way. Throughout the section, we shall assume the following. 

Assumption (Split size). Let i be learning sample size, so that i + m = 
n. We shall assume from now on, to simplify the presentation, that ^ is a 
fraction of n, typically n/2 or n/4. 
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4.1. About the split, jackknife. The behavior of the aggregate fn can de- 
pend strongly on the spht selected in Step 1, in particular when the number 
of observations is small. Hence, a good strategy is to jackknife: repeat, say, J 
times Steps 1-3 to obtain aggregates {in \ • • • , fi'^^}, and compute the mean: 

1 

f — - v?(^) 

i=i 

This jackknifed estimator provides better results than a single aggregate, see 
Section 5 for an empirical study, where we show also that it gives more sta- 
ble estimators than the ones involving cross-validation of generalized cross- 
validation. By convexity of / i— > ||/ — /o|P, the jackknifed estimator satisfies 
the same upper bounds as a single aggregate: each of the adaptive upper 
bounds stated below also holds when we use the jackknife. 

For the set of weak estimators considered in this paper, the split of the 
data is not a theoretical artefact. Indeed, if one skips Step 1 (compute -F(A) 
and fn using the whole sample Dn), then f„ has a very poor performance. 
An empirical illustration of this phenomenon is given in Figure 2. Herein, we 
show the aggregation weights (3.2) when the data is splitted and when it is 
not splitted. We consider an univariate design and cubic smoothing splines. 
Namely, we compute the set F(A) of PERM (see (2.1)) with = {f £ 
L2([0,1]) : J f^^\t)dt < +00} and penalty pen(/) = J f^'^\t)dt, where 
/(^) stands for the second derivative of /. We do that for several smoothing 
parameters h in a grid H, so that A := {{h,T) : h G H}. We used the 
smooth. spline routine in the R software to compute F{N). In Figure 2, the 
X-axis is related to the value of h: it is the value of the parameter spar from 
the smooth. spline routine. The vertical line is the value of spar selected 
by cross-validation. The conclusion from Figure 2 is that, when the data is 
not splitted, an overfitting phenomenon occurs: the aggregation algorithm 
does not work, since it does not concentrate around a value of spar. Of 
course, the resulting aggregated estimator has a very poor performance. 

4.2. How to derive the adaptive upper hounds. In every examples con- 
sidered below, the scheme to derive adaptive upper bounds is as follows. 
Say that [J^p : (3 £ B) is a set of embedded functions classes {T^ C J^fs' if 
13 < 13') where each J^p satisfy Assumption {Cp). Let Bn be an appropri- 
ate discretization of B. Let f„ be the aggregated estimator obtained using 
Steps 1-3 (see the beginning of the section), with parameter A = A„ = 
{(n-2/(2+/3), jc-^) -.peBn} and let M„ be the cardinality of F(A„). Let 
and be the expectations with respect to, repectively, the joint laws of 
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Fig 2. Aggregation weights with split (left) and without split (right) and smoothing pa- 
rameter obtained by cross-validation (vertical line) 



Dm and D(^m^, so that, by independence, we have E^'l] = Let 
/o G J^f3Q for some £ B. Using Theorem 4, we have 

^M||f„-/of <C mm ||; _ + CXfogn)V2logM, 
/eF{An) n 



<C||/A„-/or + 



2 , Caogn)i/2logM„ 



n 



where A„ = (n~^/*^^"'"^"\ J^/3„), with /?„ G i?„ chosen such that J^p^ C 
and n"^/*^^"*"^") < Cin~^/(^+^''). Then, integrating w.r.t. to E"^ and using 
Theorem 2, we have, if M„ is no more than a power of n: 

E^'fn - /of < Ci?'"||/A„ - /of + o(n-2/(2+/3o)) 

< C2n-2/(2+/3") + o(n-2/(2+A)) < ^3^-2/(2+/3o)^ 

This prove that, if /o € J^pg for some /3o G B, we have i?"||fri — /o|P < 
Csn^^/^^+^o), thus f„, is indeed adaptive over (JF^ : l3 G B). 

4.3. Sobolev spaces, spline estimators. When is a Sobolev space, the 
PERM (2.1) with a = 2 is a very popular smoothing technique: see, among 
others, [46] and [14]. The most simple example is when d = 1 and 

T = Wm 1]) := {/ G L\[0, 1]) : \ff^s := £ f^^\tfdt < oo}. 
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where s is some natural integer and /^^^ stands for the s-th derivative of /. 
In this case, the PERM is called a smoothing spline, since in this situation 
the unique minimizer of (2.1) is a spline, see for instance [46] or [15]. When 
s = 2 (cubic splines), the routine smooth, spline from the R software (and 
for other softwares as well) neatly computes the solution to (2.1) using the 
B-spline basis, and chooses the parameter h via generalized cross-validation 
(GCV). 

The d-dimensional case is easily understood with the definition of TV^ ([Oi 1]"^) 
as the space of functions / G ^^([0, l]'') with all derivatives of total order s 
in L2([0, l]'^). Namely, 

Wm if) := {/ G L2([0, 1]^) : |/|^.([o,i].) < oo}, 

where 

\f\hmm--= E iLj^-f^-))'d., (4.1) 

where for k = {ki, . . . , kd) we use the notations k! := nf=i ki\ and |k| := 
J2i=iki and where I?k is the differential operator / {d'^'^ ■ ■ ■ d^^) . When 
d > 1, the PERM for the choice JF = (P' ^Y") called a thin plate spline, 
see again for instance [46] or [15], where the practical computation of such 
PERM is explained in details. The usual assumption s > d/2 gives the 
embedding Ws([0,l]'^) C C[0,1]'^ and that Assumption {Cp) holds, see [6]. 
The situation where s is not an integer is a particular case of what we do 
in Section 4.5 below. The case where T \s & Sobolev space is actually a 
particular case of both the next sections. Indeed, it is well known (see [46] 
for instance) that a Sobolev space is a Reproductive Kernel Hilbert Space 
(RKHS) for an appropriate kernel choice, and that it is also a Besov space 

^2,2- 

4.4. Reproductive Kernel Hilbert Spaces. Reproductive Kernel Hilbert 
Spaces (cf. [2]), RKHS for short, provide a unified context for regularization 
in a wide variety of statistical model. Computational properties of estimators 
obtained by minimization of a functional onto a RKHS make these functions 
space very useful for statisticians. In this short section, we briefly recall some 
definitions and computational properties of RKHS. 

Let X be an abstract space (in this paper, we take X = [0, l]'^). We say 
that K : X xX > — > M is a reproducing kernel, RK for short, if for any integer 
p and any points xi, . . . , Xp in X , the matrix {K{xi, Xj))i<ij<p is symmetric 
positive definite. Let K he a RK. The Hilbert space associated with K, called 
Reproducing Kernel Hilbert Space and denoted by TCk, is the completion of 
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the space of all the finite linear combination '}2j^jK{xj,-) endowed with 
the inner product {Y,j ajK{xj, •), Efc hK{yk, ■))k = J2j,k ajbkK{xj,yk). We 
denote by | • the associated norm on TCk- 

The representer theorem (see [28] for results on optimization in RKHS) is 
at the heart of minimization of functional onto RKHS. The solution of the 
minimization problem 



/eargmin{i?„(/) + /i^|/|^^J (4.2) 

feHK 



is the linear combination 

n 

/(•) = aiK{Xi, •), where a = (aj)i<i<„ = (K^ + n/i^In)"-^Y, 

1=1 

where is the Gram matrix {K{Xi, Xj))i<ij<n, where Y = {Yi, . . . ,y„) 
and where I„ is the identity matrix in M". They are many different ways to 
simplify the computation of the coefficients a, see for instance [1]. 

In order to derive convergence rates for the estimator defined in (4.2) from 
Theorem 2, we use some results about covering numbers of RKHS obtained 
in [10] (other results on the entropy of RKHS can be found in [8, 38]). Let 
now assume that Px is a Borel measure. If is a Mercer kernel (this is a 
continuous reproducing kernel), the RKHS associated with K is the set 

oo oo 

i=i i=i 



where (Aj)j>i is the sequence of decreasing eigenvalues of the operator 
Lk '■ 



L\Px) L\Px) 

f ^ S^K(.,y)f[y)dPx{y) 



and {ipj)j<i the sequence of corresponding eigenvectors. According to Propo- 
sition 9 and Theorem D in [10], if for any k > 1 the k-th eigenvalue of Lk 
is such that 

Afc < Ck~^ (4.3) 

for some C > and I > 1/2 then the entropy of Bk{R) '■= {/ G Ti-x '■ 
< R} satisfies for any 6 > 0: 

2RCi \ 1/' 



HMBk{R)) < 

where C/ is slightly greater than QClK In this case. Theorem 2 and the 
arguments from Section 4.2 gives the following result. 
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Corollary 1 (Adaptive upper bound for RKHS). Let f be defined 
by (4.2) with a reproducing kernel K such that the eigenvalues of the operator 
Lk satisfy (4.3). Then, if h = an~'/(^'+^) and \\f — /o||oo ^ Q, we have 

i^"ll/-/olli.(P,)<C2(l + |/o|?,Jn-2'/(2'+i) 

when n is large enough. 

Now, let L = [Zmiri) ^max] where /min > 1/2 and {Til : I £ L) be a family of 
nested RKHS. Assume that the kernel of each TCi satisfies (4.3). Let f„ be the 
aggregated estimator defined by Steps 1-3 with A„ = {A = {n^^/^'^^^^\TLi) '■ 
I £ Ln] and La := {/min, Imin + (logn)~\ . . . , /max}- Wc havc, if fo G Til for 
some I G L, 

i?"l|fn-/o|li.(P,)<C2(l + |/o|?,Jn-2'/(2'+i) 
when n is large enough. 

4.5. Anisotropic Besov spaces. In nonparametric estimation literature, 
Besov spaces are of particular interest since they include functions with in- 
homogeneous smoothness, for instance functions with rapid oscillations or 
bumps. Roughly, these spaces are used in statistics when we want to prove 
theoretically that some adaptive estimator is able to recover the details of 
a functions. When one considers a multivariate regression, the question of 
anisotropic smoothness naturally arises. Anisotropy means that the smooth- 
ness of /o differs in function of coordinates. As far as we know, adaptive 
estimation of a multivariate curve with anisotropic smoothness was previ- 
ously considered only in Gaussian white noise or density models, see [19], 
[24], [25], [37]. There is no results concerning the adaptive estimation of the 
regression with anisotropic smoothness on a general random design. 

In this Section, we construct, using Steps 1-3, an adaptive estimator over 
anisotropic Besov spaces -Bp^g, where s = (si, . . . , s^) is the vector of smooth- 
nesses. If {ei, . . . , Cfi] is the canonical basis of W^, each Sj is the smoothness 
in the direction e^. A precise definition of -Bp ^ is given in Appendix A. Let s 
be the harmonic mean of s, see (2.3). Let us introduce two vectors s™''^ and 
^max j^d with positive coordinates and harmonic means s™"^ and s™'" re- 
spectively. Assume that s"^'" < s™ax^ which means that s™™ < sf^^^ for any 
i G {1, . . . , d} and assume that s™''^ > d/ min(p, 2). In view of Theorem 5 
and the embedding (A.l) (see Appendix A), we know that Assumption {Cp) 
holds for every -Bp^oo such that s > s™" with (3 = d/s (and every -Bp^g, since 
-Bpg C -Bp oo), where s is the harmonic mean of s. Consider the "cube of 
smoothness" 
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and consider the uniform discretization of this cube with step (logn) ^: 

Sn := n {^"" + Hlogn)-' ■.l<k< - sfn logn]}, (4.5) 

i=l 

and the set of parameters 

A{S) := {A = (n^^/(2s+d)^^s^^ . ^ ^ 

Now, we compute, following Steps 1-3, the aggregated estimator with 
set of parameters A{S) (see the beginning of the section). Following the 
arguments from Section 4.2, we can prove in the following Corollary 2 that f^ 
is adaptive over the whole range of anisotropic Besov spaces {Bp ^ : s £ S}. 

Corollary 2. Assume that \\f - /o||oo < Q for every f G F{S). If 
fo G Bp^g for some s £ S, then 

when n is large enough, where C is a constant depending on S,d and Q. 

In Corollary 2 we recover the "expected" minimax rate 7^-2s/(2s+d) 
estimation of a d-dimensional curve in a Besov space. Note that there is no 
regular or sparse zone here, since the error of estimation is measured with 
L^{Px) norm. A minimax lower bound over Bp^ can be easily obtained using 
standard arguments, such as the ones from [40], together with Bernstein 
estimates over Bp ^ that can be found in [18]. Note that the only assumption 
required on the design law in this corollary is the compactness of its support. 

5. Empirical study. In this Section, we compare empirically our ag- 
gregation procedure with the popular cross-validation (CV) and generalized 
cross-validation (GCV) procedures for the selection of the smoothing pa- 
rameter h (see Section 2.5) in smoothing splines (we use the smooth, spline 
routine from the R software, see http : / /www . r-pro j ect . org/) . Concerning 
CV, GCV and smoothing splines, we refer to [46] and [14]. Those routines 
provide satisfactory results in most cases, in particular for the examples 
of regression functions considered here. However, we show that when the 
sample size n is small (less than 50), and when the noise level is high (we 
take root-signal-to-noise ratio equals to 2), then our aggregation approach 
is more stable, see Figure 4 below. Here in, we consider two examples of 
regression function, given, for x £ [—1,1], by: 
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• hardsine(2;) = 2 sin(l + x) sin(27rx^ + 1) 

• oscsine(x) = (x + 1) sin(47rx^). 

We simply take X uniformly distributed on [—1, 1] and Gaussian noise with 
variance a chosen so that the root-signal-to-noise ratio is 2. In Figure 3 we 
show typical simulation in this setting, where n = 30. 




Fig 3. Examples of simulated data, /or /o=harsine (left) and /o = oscsine (right) 

In Figure 4, we show the mises E\\fn — /olln computed by Monte Carlo 
using 1000 simulations of the model. The tuning of the estimators in both 
examples is the following: for GCV, we simply use the smooth, spline rou- 
tine with default selection of h by GCV. For CV, we use the same routine, 
with the option cv=TRUE so that CV is used instead. For aggregation, we use 
Steps 1-3 (see Section 4). Step 1 is done with m = 3n/4 and £ = n/4. For 
Step 2, we use the smooth, spline routine to compute a set of weak estima- 
tors, using the option spar=x, where x lies in the set {0, 0.01, 0.02 . . . , 1}. 
The parameter spar is related to the value of the smoothing parameter h. 
For Step 3, we compute the weights with temperature given by (3.3) (over 
the training sample) and the set T = {10, 20, . . . , 100}. Then, we repeat 
steps 1-3 J = 100 times and compute the jackknifed estimator, see Sec- 
tion 4.1. This gives our aggregated estimator. 

On Figure 4, we plot the MISEs (the mean of the 1000 MISEs obtained 
for each simulation) for sample sizes n G {20, 30, 50, 100} and in Figure 5 
we plot the corresponding standard deviations. The conclusion is that for 
small n, aggregation provides a more accurate and stable estimation than 
the GCV or CV. When n is 100 or larger, than the aggregation procedure 
has barely the same accuracy as GCV or CV. 
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Fig 5. standard deviation of the MISE for /o=liarsine {left) and /o=oscsine (right) 
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6. Proofs of the main results. We recall that Pn stands for the joint 
law of the training sample Dn conditional on X" := {Xi, . . . ,Xn), that is 
Pn := P'^HX"]. 

Proof of Theorem 1. First, we use the peeling argument: we decom- 
pose Bn{fo, ^) into the union of the sets Sj for j > 0, where for 6j := 52~^/^ 

Sj := Bnifo, 5j) - Bnifo, Sj+i), 
and decompose J- into the union of the sets 

for > 1, where 5^(2^/^) = {/ e JT : |/|^ < 2^/'^} This gives that the left 
hand side of (2.6) is smaller than 

> Pn sup > Z 

\f\T<l 

Z{f - /o) 



+ 



jl^ofe' "^f 65,05^2'=/^) 11/ - /o||n"^/'(l + 1/1^)^/2 



> Z 



which is smaller than 



sup Z{f - fo) > z{5,j,k) 

/ei?„(/o,5j)ns^(2fe//5) 



where z{6,j, k) := zijj l^/'^2^l'^ ^1'^ . Let us consider, for any (5 > 0, a minimal 
o-covermg P((5, k) of the set P^(2'=/^) for the || 

■ lloo'iiorni. Assumption (C/j) 

implies 

\F(b,k)\ < exp {p{2^l<^lbf) = exp{D2''6-f^). 

Moreover, without loss of generality, we can assume that F{S,k) C B^{2^l(^). 
For any i G N and j, k fixed, we introduce 

F(^) := F{6,^j, k) where 5i,j := <5j2-^/^ = 52-^^+1)/^, (6.1) 

and, for any / G Bjr{2^/I^) we denote by 7rj(/) an element of P^*^ such that 
IKi(/) - /lloo < We have 



Bj^k — Bn 



Pr, 



sup |Z(^o(/)-/o)| >^(<5,J,A:)/2 

76i?n(/o,5j)niJ^(2fe//5) 

sup |Z(/-7ro(/))| >z(,5,j,A:)/2 

/eB„(/o,<5j)nB^(2fe/'3) 
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First, we consider Pj,k,i- 

Pj,k,i < Pn\ sup iZiMf) - /o)l > 46, j, k)/2 
feFmnB„{fo,5j) 

We use (2.5) and the union bound over F^^^ together with the fact that 
/ e Bn{fo,5j) to obtain: 

where a := (2&^)^^. Now, in order to control Pj^k,2i we use the so-cahed 
chaining argument, which involves increasing approximations by the covers 
, see (6.1). Let us consider 

Ei := (21//3-1/2 _ i)2-'(i//3-i/2) 

for i > 1 {Ei > since f3 G (0, 2)). By linearity of Zn{-) and since J2i>i = 
1, we have 

Pj,k,2 < E^4 - > Eiz{6,j,k)/2 

i>i ^/eB„{/o,5,) 

|/|^<2'=/'3 

i>l 

Now, since 

||7r,(/) - 7r,_i(/)||„ < \\7Tiif) - ^i_l(/)||oo 

< IKi(/) - /Hoc + ||vr._i(/) - /Hoc 
and since the number of pairs {TTi(f), '7rj_i(/)} is at most 



X |f(*~^)| < exp 
we obtain using again (2.5): 



25/3 



F < |F«| X X exp fZ^Mf!MM^ 



;,(i + 2i//3)^ 

exp(^^(3F'/2-Ciz2- 
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where Ci = Ci{s,d,a) := a{2^l^-^l'^ - 1)/(8(1 + 2^1 f^f) > 0. Then, if we 
choose zi := we have for any z > zi and Di := Ci/2: 

j,k>0 j,k>0 i>l 

< J2 {eM-Di2^^''zH~'')+J2^M-Di2'+^+'z^6-'')) 

j,k>0 i>l 

and the Theorem fohows. □ 

Proof of Theorem 2. For short, we shah write / instead of fx, and 
pen(/) instead of pen;^(/). In view of (2.1), we have 

||y-/||2+pen(/) < ||y-/||2 +pen(/) V/ G .F, (6.2) 
which is equivalent to 

11/ - fWl + pen(/) < 2{Y - fj - /)„ + pen(/) V/ G J', 
where {f,g)n = ^27=1 fi^ddi-^i)- This entails, since /o € J^, that 

11/ - /oil' +pen(/) < + pen(/o) (6.3) 



where Z{-) is the empirical process given by (2.4). Recall that Bn{fo,6) 
stands for the ball centered at /o with radius 5 for the norm 11 • Wn- Let us 
introduce the event 

Z{z,S):={ sup ^H^^ r^4- (6-4) 

^ fe^nBAM 11/ - /o||^^/'(l + |/|^)/^/2 J 

In view of Theorem 1, see Section 2.3, we can find constants zi > and 
Di > such that: 

P4Z{z,Sf] <exp(-Z)iz2r^), 

for any 6 > and z > z\. When 2n^^l'^Z{J — /o) < pen(/o), we have 
11/ - /oll^ < 2pen(/o). When 2n-^l'^Z{] - j^) > pen(/o), we have, for any 
z > 0, in view of (6.3), whenever / G Bn{fo,6) for some <5 > 0, that on 
Z{z,S), 

Wf-Ml + Pen(/) < ^11/ - /o|ir^/'(l + l/V)^/'. 
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If \f\:F ^ 1) this entails 

11/ -fo\\l + pen(/) < (a-2(2/^4z)4/(2+/5) + l)h'. 
Otherwise, we have 

ll/-/o||^ + pen(/) < ^||/-/o||r^/V"lJ/^ 



/n 

and we use the fohowing lemma. 

Lemma 1. Let r, I, h,£ be positive numbers, (3 £ (0,2) and a > 2(3 /{(3 + 
2). Then, if 

^2 ^ f^2ja ^^^1^(3/2 j(3/2^ (6.5) 

we have 

r < (£"/i-/3)2/(2Q+a/3-2/3)^ j < (g2^-(/3+2)>j2/(2a+a/3-2/3) 

and consequently 

+ h^I^ < 2(e"/i-/3)4/(2«+«/3-2/3)_ 

The proof of this Lemma is given in Section 7 below. It entails, since 
h = an-i/(2+/3) aj^j > 2p/{l3 + 2), that 

\\f-f4l + h^\f\- < 2((2'^/24z)°a-^)4/(2°+"/5-2/')n-2/(/5+2). 

Thus, when / G Bn{fo, S), we have on ^^(z, 6): 

ll/-/o||'+pen(/) <K^)2/x2 

where 

:= Ci{l + + ^4a/i2a+at3-2f3)^ 

and Ci is a constant depending on q,/3 and a. Let us assume for now that 
11/ — fo\\n < 5 for some (5 > 0, and let us introduce 

Zi{z,d) := Z{z,5) nZ{zi,p{z)h), 

where zi is a constant coming from Theorem 1. On Zi(z,6), we have 

ll/-/o||'+pen(/)<K^i)V. (6.6) 
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Indeed, we have / G -B„(/o, 6) thus, on Z{z, 6), ||/-/o||^+pen(/) < p{zifh'^ 
and so ||/ — /o||^ < p{z)'^h?'. Thus, on the event Z{zi,p{z)h), we have (6.6). 
Moreover, Theorem 1 yields 

Pn[Zi{z,6f] < exp{-DizH~^) + exp{-Dizf{piz)hy^). (6.7) 

Now, in view of (6.2) and since /o G J-', we have the following rough majo- 
ration: 

\\f-fo\\l + Pen(/) < 2(11/ - Y\\l + pen(/)) + 2||/o - Y\\l 
<2(||/o-y||2+pen(/o)) + 2||/o-y||^ 
<4a2||e||2 +2pen(/o), (6.8) 

which entails 

i^n[(||/ - /oll^ +pen(/))'] < cT^C7(e)2 + 8/i4|/o|2a 

where C{ef = 32{E[e'^]/n + 2{E[e^])^). Putting all this together, we obtain, 
by a decomposition of -En[||/ — fo\\n + pen(/)] over the union of the sets 
{II/ - /olln <S}n Zi{z, 5), Zi{z, 6)^ and {||/ - fo\\n > S} that 

En[\\f-fo\\l + penif)] <piz,fh^ 

+ {a^C{e) + 2V2hyom{Pn[Zi{z, 6f]'/' + P„[||/ - /o||„ > 6]'/^). 

In view of (6.8), if 5 > 2 pen(/o) VI then we have {||/-/o||^ > S'^} C {||e||^ > 
((5^ — (5)/(4iT^)}. Thus, using the subgaussianity assumption (1.3), we have 
P[\\f-M\n > ^]^/' < exp(-(52 - df/{8a^)) < (exp(-C2(logn)4)) = o{h^) 
if one chooses 6 = logn. Now, using (6.7) with this choice of 6 and z = 
(logn)i+/3/2 we have also P„[Zi(z, (5) < exp(-C3(log n)^) = o{h^). This 
concludes the proof of the first upper bound of Theorem 2. 

To prove the upper bound for the integrated norm || • || instead of the 
empirical norm || • ||„, we decompose ||/ — /o|p = j4i + ^2 where 

Ai := 11/ - /of - mf-fo\\l + Pen(/)) and A2 := 8(||/ - /o||^ + pen(/)). 
The first part of Theorem 2 provides 

^"[^2] <Ci(l + |/o|^)n-2/(2+/3). 

Recall that we assumed that ||/ — /o||oo < Q a.s. for the second part of the 
Theorem. To handle Ai, we use the following Lemma. 
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Lemma 2. Let (J^, | • and h satisfy the same assumptions as in The- 
orem 2. Define Tq := {f £ : \\f — /o||oo < Q}- We can find constants 
zq, Dq > such that for any z > zq: 

P^[3f G J^Q : 11/ - /of - 8(11/ - foWl + pen(/)) > Wzh^] 

< exp ( — Donh'^z), 

where zq and Dq are constants depending on a, a, (3 and Q. 

The proof of Lemma 2 is given in Section 7. Using together Lemma 2 and 
the fact that Ai < a.s., we have by a decomposition over the union of 
{^1 > lOzo/i^} and {^i < lOzoh"^}: 

< 10 zoh"^ + o{h^). 

This concludes the proof of Theorem 2. □ 




Fig 6. Example of a setup in which ERM performs badly. The set -F(A) = {/i, . . . , /a/} 
is the dictionary from which we want to mimic the best element and fo is the regression 
function. 



Proof of Theorem 3. We consider a random variable X uniformly dis- 
tributed on [0, 1] and its dyadic representation: 

+00 

X = J2 X^'^h-'', (6.9) 
fe=i 

where (X(^) : k > 1) is a sequence of i.i.d. random variables following 
a Bernoulli ;B(1/2,1) with parameter 1/2. The random variable X is the 
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design of the regression model worked out here. For the regression function 
we take 

, , , f 2/i if = 1 , , 

where x has the dyadic decomposition x = J2k>i x^^'^2~^ where x^^"^ G {0, 1} 
and 

^^C l logM 
4 V n ' 

We consider the dictionary of functions Fm = {fi, • • • , Jm} 

= 2x(j) - 1, Vj G {!,..., M}, (6.11) 

where again {x^^^ : J > 1) is the dyadic decomposition of x € [0,1]. The 
dictionary Fjyj is chosen so that we have, for any j € {1,...,M — 1} 

Wfj - /olli2([o,i]) = ^ + ^ ^^^^ ll-^A^ - /o|li2([o,i]) = ^-h + l. 
Thus, we have 

j=™%/ ~ •^olli2([o,i]) = ll-^Af - /o|li2([o,i]) = ^ + 1- 

This geometrical setup for F(A), which is a unfavourable setup for the ERM, 
is represented in Figure 6. For 

fn := f™"^ G argmin + pen(/)), 

where we take = ^ EtiiYi " /(^.))' = \\Y " /lln, we have 

EUn - /o|li2([o,i]) = ^.jnin^^ II/, - /o|li2([o,i]) + hP[l / /m]. (6.12) 

Now, we upper bound P[fn = /a/] - If we define 

1 " 



n 1- , 



we have by the definition of h and since S {—1,1}: 



/:F-/Afii^-r-/,ii^) 



- nm + E(cP^cr) + 3(cF) - ci^^) - 1) 

4C 



> Nj-Nm VlogM. 
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This entails, for Nm~i '■= maxi<j<7v-i Nj, that 



M-l 



P[fn = fM]=P[ n {\\Y-fM\ 



< P 



[Nm > Nm-1 



6C 



a 



\Y - fjWl < pen(/j) - pen(/A./)| 



It is easy to check that A^i, . . . ,Nm are M normahzed standard gaussian 
random variables uncorrelated (but dependent). We denote by C the family 
of Rademacher variables (Cp^ : i = 1, . . . ,n; j = 1, . . . , M). We have for any 
6C/a < 7 < (2\/2c*)~"^ (c* is the "Sudakov constant", see Theorem 7), 

6C 



P[fn = Jm] < E 



p(Nm > Nm-1 - — TbgM C 
V a 

< P[Nm > -jV^ogM + E{Nm-i\C)] 

6C 



(6.13) 



E 



p{e{Nm-i\0 - Nm-1 > (7 - —)Vh^\c} 



Conditionally to the vector (A^i, . . . , A^j\/-i) is a linear transform of the 
Gaussian vector (ei, . . . , e„). Hence, conditionally to (-^i) • • • ) ^M-i) is 
a gaussian vector. Thus, we can use a standard deviation result for the 
supremum of Gaussian random vectors (see for instance [36], Chapter 3.2.4), 
which leads to the following inequality for the second term of the RHS 
in (6.13): 

p{e{Nm-i\C) - Nm-1 > (7 - ^)y/i^\c} 

< exp(-(3C7/f7- 7/2)2 log M). 

Remark that we used £'[A'^?|^] = 1 for any j = 1, . . . , M — 1. For the first 
term in the RHS of (6.13), we have 



Nm > --fV^ogM + E{Nm-i\0 
< P 



+ P 



Nm > -27v/log M + E{Nm-i) 
-^^/\^ + E{Nm-i) > E{Nm-i\C) 



(6.14) 



Next, we use Sudakov's Theorem (cf. Theorem 7 in Appendix B) to lower 
bound E{Nm-i)- Since (A^i, . . . , A^a/_i) is, conditionally to C, a Gaussian 
vector and since for any 1 < j ^ k < M we have 



1 



n 
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then, according to Sudakov's minoration (cf. Theorem 7 in the Appendix), 
there exits an absolute constant c* > such that 



1 " 

c*E[Nm-i\C]> min f- V (C 



(k) 



-(i)\2 



1/2 



v/logM. 



Thus, we have 

c*E[JV„_,| > E 



1/2- 



mm 



> V2 1 - S 



■^^ 1=1 



where we used the fact that ^/x > a;/v2,Vx G [0,2]. Besides, using Hoeffd- 
ing's inequahty we have -E'[exp(s^(-''^))] < exp(s^/(2n)) for any s > 0, where 
^{j,k) ._ '^V'^-^ Ci'^^C^"'^- Then, using a maximal inequality (cf. Theorem 8 
in Appendix B) and since log[(M — 1)(M — 2)] < 1/4, we have 



E 



I E cf ^CFI < log[(M - 1)(M - 2) 



1/2 1 
< . 

- 2 



(6.15) 



This entails 



c*E[Nm-i] > 



log M\ 1/2 



Thus, using this inequality in the first RHS of (6.14) and the usual inequality 
on the tail of a Gaussian random variable {Nm is standard Gaussian), we 
obtain: 



P 



Nm > - 27VlogM + ^(A^, 



M~l, 



< P 



< 



Nm > iic*V2)-^ - 27) V^^Pf' 

(6.16) 



< exp ( - ((c*^/2)-i - 27)2(log M)/2) . 



Remark that we used 2\/2c*7 < 1. For the second term in (6.14), we ap- 
ply the concentration inequality of Theorem 6 to the non-negative random 
variable £'[-/Vjvf _i |^] . We first have to control the second moment of this vari- 
able. We know that, conditionally to C: Nj\C ~ Af{0,l) thus, Nj\C G L^^ 
(for more details on Orlicz norm, we refer the reader to [45]). Thus, 

1] max A,-|C|U, < Kxh^-^iM) max ||A,-|C|Lo 
" l<j<M-l -^i^iiv^a - ^ l<j<M~l" -^'^"^2 
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(cf. Lemma 2.2.2 in [45]). Since ||A^j|CII^2 = 1' 

we have || maxi<j<Af-i -^j|CI|i/'2 ^ 
K^/log M. In particular, we have £^[maxi<j<A./-i -^J |C] ^ if log M and so 
E{E[NM~i\C]f < if log M. Theorem 6 provides 

- 7\/logM + ^[iVM-i] > ^[iVM-ilC]] < exp(-7Vco), (6.17) 

where cq is an absolute constant. 

Finally, combining (6.13), (6.16), (6.14), (6.17) in the initial inequality 
(6.13), we obtain 

P[fn = fu] < exp(-(3C7/a - 7)'logM) 

+ exp ( - {{c*V2)-^ - 27)2(logM)/2) + exp(-7Vco). 



Take 7 = {12V2c*)-\ It is easy to find an integer Mo^a) depending only 
on £7 such that for any M > Mq, we have P[fn = /m] < ci < 1, where ci 
is an absolute constant. We complete the proof by using this last result in 
(6.12). □ 

Proof of Theorem 4. We recall that we have a dictionary (set of func- 
tions) F{A) of cardinahty M such that ||/a — /o||oo < Q for all A G A. Let 
us define the risk 

R{f) := E[{Y - f{X)f] 
and the linearized risk over F(A), given by 

R{e) := ^A^(/a) 
AeA 

for 6 £ Q, where we recall that 

e:={eeK^^^;9x>0, J] ^a = 1}- 

AeA 

We denote by Rn{f) the empirical risk of / over the sample D„, which is 
given by 

1 " 

RnU) ■.= -Y.{Y,-f{X,))\ 

1=1 

and we define similarly the linearized empirical risk 

AeA 
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The excess risk of a function / is given by R{f) — R{fo) = \\f — /o|P- By 
convexity of the risk, the aggregate f = J2xeA ^xfx defined in (3.1), satisfies, 
for any a > 0, 

i?(f) - i?(/o) < R(^) - i?(/o) 

< (I + a){Rn{e) - Rnifo)) 

+ R{9) - R{fo) - (1 + a){Rn{e) - Rn{fo)), 

where it is easy to see that the Gibbs weights 6 = {0x)x&A = {0{fx))x&A are 
the unique solution to the minimization problem 

T 

9e& I ' n 



minfR„(0) + - V^Alog^A}, 



AeA 



where T is the temperature parameter, see (3.2), and where we use the 
convention OlogO = 0. Let A be such that /-^ is the ERM in F{A), namely 



Rnih) ■■= mini?„(/. 



AeA 
Since 



a; 



AeA 



where K{9\u) denotes the KuUback-Leibler divergence between the weights 
9 and the uniform weights u := (l/|A|)AeA) we have 

Rn{e) < Rn{9) + -K{e\u) 

n 

= Rn{e) + Y.Ox\oge, + ^^^ 

<R.(e,) + ^^i.„(/,) + ^, 

where G S is the vector with 1 for the A-th coordinate and elsewhere. 
This gives 

i?(f) - R{fo) < (1 + a) min(i?„(/A) - i?„(/o)) + (1 + a)^^^^^ 

AeA n 

+ R(^) - R{fo) - (1 + a)(R„(^) - i?„(/o)), 

and consequently 

i^llf - /of < (1 + a) min - /of + (1 + a)^^"^ 
AeA n 

+ E[R{e) - R{fo) - (1 + a){Rn{e) - i?„(/o))]. 
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Since R(-) and R„ are linear on Q, we have 
R(^) - Rifo) - (1 + a)(R„(^) - ii„(/o)) 



< max - R{fo) - (1 + a){Rn{f) - Rnifo)))- 

/6F(A) 



Thus, we have 



E\\f-hf < (l + a)min||/A-/of + (l + a)^°^ + 7^„, (6.18) 
AeA I n 

where Tin := ^[max;eF(A){^(/) - R{fo) - (1 + a){Rnif) - i?„(/o))}]- Now, 
we upper bound 7^„. Introduce the random variables 

Ziif) := ifiXi) - fo{Xi)f + 2aeJ{\ei\ < K){fo{X,) - f{Xi)), 
Ziif) := 2aeil{\ei\ > K){fo{Xi) - f{Xi)), 

and the two following processes indexed by / G -^(A): 



1 



i=l 



C(/) - E [EiW)] - (1 + a)Z,{f)) and ((/) 



1 + a 



n 



i=l 



We use the symmetry of e to get 



+ E 



max C( f ) 

/e^{A) 



First, we upper bound ii^[maxjgp'(^) ("(/)] . The random variable ("(/) is 
bounded and satisfies the following Bernstein's type condition (see [3]): 
V/ G F{A),E[C{ff] < {Q"^ +4a^)E[C{f)]. We apply the union bound and 
the Bernstein's inequality (cf. [45]) to get, for any 5 > 0, 



P 



max C(f) > <^ 



< Mexp{-Cn5), 

where C := a[8(Q2 + o-2(l + a)2 + (4Q/3)(1 + a)(Q + 27^:)]^^ Hence, a direct 
computation gives 



E 



max C( f) 



< 



41ogM 
Cn ■ 



(6.19) 



Now, we upper bound -E'[maxjg^(A) C(/)]- We have 



E 



max C(/) < 4g(l + a)£;[|e|I(|e| > K)] 

< 4g(l + a)(TP(|e| > K)^/^ 
<4g(l + a)f7exp(-i^V(2&e))- 



(6.20) 
(6.21) 
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Finally, combining equations (6.18), (6.19)) and (6.20) with K = bey/2 log n, 
concludes the proof of Theorem 4. □ 

7. Proofs of the lemmas. 

Proof of Lemma 1. Since 13 G (0,2) we have a > 2(3 /{P + 2) > /3/2. 
Thus, inequality (6.5) gives 

log(r2 + h^n < log{e) + log(r) - (1 - ;^) log(r2) 

2 2a 

- ^ log{h) + (1-^) log(r2) + A logih^n 
a 2a 2a 

< log(e) + (- - 1 - ^) log(r) - ^ log(/i) + log(r2 + 
a 2 a 

and consequently 

^1+/3/2-/3/q < 

which entails r < (e'^/i~/3)2/(2a+a/3-2/3)_ jsjq^^ using this inequality together 
with h^I°' < e r^~^/^/^/^ provides the upper bound for /. The last inequality 
easily follows. □ 

Proof of Lemma 2. [The proof consists of a peeling of into subspaces 
with complexity controlled by Assumption (C/j) and the use of Bernstein's 
inequality.] Let us denote for short instead of J^q . Since f £ we have 

P[\\f-fof - 8(11/- /oll^ + pen(/)) > lOzh'] 

< P[3f e : 11/ - /of - 8(11/ - foWl + pen(/)) > lOzh^] 

k>2 

where 

Ai := {3/ G J^, pen(/) < 2''/^h^ : 

11/ - /of - 8(11/ - /of + pen(/)) > lOzh^} 

and for k >2, 

Ak := {3/ G .F, 2"('=-i)/^/i2 < pen(/) < 2"'=/^/i2 : 

11/ - /of - 8(11/ - /of + pen(/)) > Wzh^}. 
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Hence, since z > zq > 1 and a//3 = 2/(/3 + 2) > 1/2 since (3 < 2, we have 
^[^fc] ^ Pk for any A; > 1, where 

Pfc:=P[3/G^, pen(/) < 2"'=/^/i2 : 

11/ - /of - 811/ - Mil > '^zh' + 42-'=/^/i2] . 

Now, let A;) be a minimal (5-covering for the norm || * Woo of th.6 S6t 

{/ G ^ : pen(/) < I'^^/^h^} = {/ G ^ : |/|^ < 2^/^^}, 

where we recall that pen(/) = h?\f\jr. Assumption (C/3) entails 

|F(5,A:)| <exp(L>2^5"^). (7.1) 

Since for any /i, /2 G such that ||/i — /2II00 < ^5 we have 

ll/i - /of < 211/2 - /of + 25^ and 2||/i - /of > 2||/2 - /of - 25^^ 

we obtain 

Pk <P[3fe F{5, k):2\\f- /of - 4||/ - /of + 6<52 > 2zh^ + 42"'^/'3/i2] 
< ^ xP[||/-/of -||/-/of >t,(z)], 

f&F{S,k) 

where tfc(z) := zh^/2 + 2'^^'^h'^ - 2,5"^ / 2 + \\f - /of /2. Let / G F[5,k) 
be fixed. We introduce the random variables Ui := {f(Xi) — /o(Xj))^, so 
that 11/ - /of = ELi Ui/n and E[Ui] = \\f - /of. Note that the Ui are 
independent, such that < Ui < Q"^, and Var[C/i] < E[U^] < Q'^E[Ui] < 
Q'^Wf ~ /of - Hence, if tk{z) > \\f — /of /2, Bernstein's inequality entails 

n 

P[\\f - /of - 11/ - /of > tkiz)] = P[J2{U, - E[Ui]) > ntkiz) 

i=l 

^ ( -ntk{zf \ 

-''''^y2m\f-h\?+QHu{z)/2,)) 

^ ( 280^ ) • 



By taking 5 := (2"^/^/iV3)i/2, we have tk{z) > \\f - /of /2 and (7.1) 
becomes 

\F[5, k)\ < exp (L»in/i22'^(i-"/2)^ 
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where we used (2.7) and took Di := D3^/V«^"^^- Hence, for D2 := 3/{28Q'^), 
we have 

Pk < exp (L>in/i22'=(i-°/2) _ D^nh^l^z + 2"'=/^)). 
Now, we choose 

^ ^ r log(min(Zj2M,l)/2) l ^ 
L(l -a/2-a//3)log2J 

where [x] is the integer part of x, and where we recall that a > 2/3/ (/3 + 2), 
so that 1 — a/2 — a/ (3 < 0. The conclusion of the proof follows easily by the 
decomposition J2k>i Pk = J2i<k<K Pk + J2k>K Pki if 2: > zi for the choice 
zi := 2(2^°/^ - Z)i2^(i-°/2) /D2). □ 

APPENDIX A: FUNCTION SPACES 

In this section we give precise definitions of the spaces of functions consid- 
ered in the paper, and give useful related results. The definitions and results 
presented here can be found in [39] , in particular in Chapter 5 which is about 
anisotropic spaces, anisotropic multiresolutions, and entropy numbers of the 
embeddings of such spaces (see Section 5.3.3) that we use in particular to 
derive condition (C/3), for the anisotropic Besov space, see Section 2. 

A.l. Anisotropic Besov space. Let {ei, . . . , e^} be the canonical ba- 
sis of M'^ and s = (si, . . . , Sd) with Sj > be a vector of directional smooth- 
ness, where Si corresponds to the smoothness in direction Cj. Let us fix 
1 < p,q < 00. li f is a function in M"^, we define A^f as the difference of 
order k > 1 and step h G W^, given by A\f{x) = f{x + h) — f{x) and 
A^/(x) = Al{A^^-^ f){x) for any x G M'^. We say that / G LP(M'^) belongs 
to the anisotropic Besov space ^(M'^) if the semi-norm 

i=i -"^ 

is finite (with the usual modifications when p = 00 or q = 00). We know 
that the norms 

are equivalent for any choice of ki > Sj. An equivalent definition of the 
seminorm can be given using the directional differences and the anisotropic 
distance, see Theorem 5.8 in [39]. Following Section 5.3.3 in [39], we can 
define the anisotropic Besov space on an arbitrary domain i7 C M'^ (think of 
Q as the support of the design X) in the following way. We define Bp^^{0,) 
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as the set of all / G L^(r2) such that there is 5 G -Bp ,j(M'^) with restriction 
g\0, to 0, equal to / in L^(0,). Moreover, 

where the infimum is taken over all g G Bp g{'R'^) such that g\Q = f . In an 
equivalent way, the space Bp^^^Q) can be defined using intrisic characterisa- 
tions by differences, see Section 4.1.4 in [39], where the idea is, roughly, to 
restrict the increments h in the differences A| so that the support of A^/ 
is included in Q. 

In what follows, we shall remove from the notations the dependence on 
Q, since it is does not affect the definitions and results below. Moreover, for 
what we need in this paper, we shall simply take O as the support of the 
design X. Several explicit particular cases for the space Bp ^ are of interest. 
If s = (s, . . . , s) for some s > 0, then i?® ^ is the standard isotropic Besov 
space. When p = q = 2 and s = (si, . . . , s^) has integer coordinates, B22 is 
the anisotropic Sobolev space 



BI2 



1=1 * 



If s has non-integer coordinates, then B22 is the anisotropic Bessel-potential 
space 

H^ = {fGL^:J2p + m'r^^mll < 00}. 

i=l 

The results described in the next section are direct consequences of the 
transference method, see Section 5.3 in [39]. Roughly, the idea is to trans- 
fer problems for anisotropic spaces via sequence space (one can think of 
sequence of wavelet coefficients for instance) to isotropic spaces. This tech- 
nique allows to prove the statements below. Note that another technique of 
proof based on replicant coding can be used, see [26]. This is commented 
below. 

A. 2. Embeddings and entropy numbers. Let us first mention the 
following obvious embedding, which is useful for the proof of adaptive upper 
bound (see Section 4.2). If < si < sq coordinatewise, that is < si^i < so,i 
for any i G {1, . . . , d}, we have 

This simply follows from the fact that B^„ = r\j_-,B't\-, where Bt.\i is the 
corresponding Besov space in the i-th direction of coordinates, with norm 
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extended to the other d — 1 directions (see Remark 5.7 in [39]) together 
with the standard embedding for the isotropic Besov space. 

As we mentioned below, Assumption {Cp) (see Section 2) is satisfied for 
barely all smoothness spaces considered in nonparametric literature. In par- 
ticular, if ^ = q is the anisotropic Besov space defined above, {Cp) is 
satisfied: it is a consequence of a more general Theorem (see Theorem 5.30 
in [39]) concerning the entropy numbers of embeddings (see Definition 1.87 
in [39]). Here, we only give a simplified version of this Theorem, which is 
sufficient to derive {Cp). Indeed, if one takes sq = Po = Pi Qo = Q and 
si = 0, pq = oo, (70 = oo in Theorem 5.30 from [39], we obtain the following 

Theorem 5. Let 1 < p, q < oo and s = (si, . . . , Sd) where Si > 0, and 
let s be the harmonic mean of s {see (2.3)). Whenever s > d/p, we have 

b;,, c cm, 

where C{Q) is the set of continuous functions on Q, and for any 5 > 0, the 
sup-norm entropy of the unit ball of the anisotropic Besov space, namely the 
set 

satisfies 

H^{5,U;^^)<D5-'^I\ (A.2) 
where D > is a constant independent of 6. 

For the isotropic Sobolev space. Theorem 5 was obtained in the key pa- 
per [6] (see Theorem 5.2 herein), and for the isotropic Besov space, it can 
be found, among others, in [5] and [26]. 

Remark. A more constructive computation of the entropy of anisotropic 
Besov spaces can be done using the replicant coding approach, which is done 
for Besov bodies in [26]. Using this approach together with an anisotropic 
multiresolution analysis based on compactly supported wavelets or atoms, 
see Section 5.2 in [39], we can obtain a direct computation of the entropy. 
The idea is to do a quantization of the wavelet coefficients, and then to code 
them using a replication of their binary representation, and to use 01 as a 
separator (so that the coding is injective). A lower bound for the entropy can 
be obtained as an elegant consequence of Hoeffding's deviation inequality 
for sums of i.i.d. variables and a combinatorial lemma. 
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APPENDIX B: SOME PROBABILISTIC TOOLS 

For the first Theorem we refer to [12]. The two following Theorems can 
be found, for instance, in [34, 36, 45]. 

Theorem 6 (Einmahl and Masson (1996)). Let Zi, . . . , Z„ be n inde- 
pendent non-negative random variables such that E[Zf] < a'^,\/i = 1, . . . , n. 
Then, we have, for any 5 > 0, 

P[f] - E[Zi] < -n6] < exp ( - ^) . 

1=1 

Theorem 7 (Sudakov). There exists an absolute constant c* > such 
that for any integer M, any centered gaussian vector X = {Xi, . . . , Xm) in 
R'^^, we have, 

c*E[ max XA > e^/logM, 
i<j<M - ^ « 

where e := min E[{Xi - Xj)^] : i / j € {1, . . . , M}}. 

Theorem 8 (Maximal inequality). Let Yi, . . . ,Ym be M random vari- 
ables satisfying £'[exp(sl^)] < exp((s^(T^)/2) for any integer j and any 
s > 0. Then, we have 

E\ max YA < a^ylogM. 

i<i<A/ 
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